0.0.1 Load packages and data

library(tidyverse) 
library(dsbox)
library(mosaicData) 
library(ggplot2)
library(ggsci)
library(maps)
library(rayshader)
library(plotly)

0.0.2 Faculty

staff <- read_csv("data/instructional-staff.csv")
## Rows: 5 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): faculty_type
## dbl (11): 1975, 1989, 1993, 1995, 1999, 2001, 2003, 2005, 2007, 2009, 2011
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
staff
## # A tibble: 5 × 12
##   facult…¹ `1975` `1989` `1993` `1995` `1999` `2001` `2003` `2005` `2007` `2009`
##   <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Full-Ti…   29     27.6   25     24.8   21.8   20.3   19.3   17.8   17.2   16.8
## 2 Full-Ti…   16.1   11.4   10.2    9.6    8.9    9.2    8.8    8.2    8      7.6
## 3 Full-Ti…   10.3   14.1   13.6   13.6   15.2   15.5   15     14.8   14.9   15.1
## 4 Part-Ti…   24     30.4   33.1   33.2   35.5   36     37     39.3   40.5   41.1
## 5 Graduat…   20.5   16.5   18.1   18.8   18.7   19     20     19.9   19.5   19.4
## # … with 1 more variable: `2011` <dbl>, and abbreviated variable name
## #   ¹​faculty_type
staff_long <- staff %>%
  pivot_longer(cols = -faculty_type, names_to = "year") %>%
  mutate(value = as.numeric(value))

staff_long
## # A tibble: 55 × 3
##    faculty_type              year  value
##    <chr>                     <chr> <dbl>
##  1 Full-Time Tenured Faculty 1975   29  
##  2 Full-Time Tenured Faculty 1989   27.6
##  3 Full-Time Tenured Faculty 1993   25  
##  4 Full-Time Tenured Faculty 1995   24.8
##  5 Full-Time Tenured Faculty 1999   21.8
##  6 Full-Time Tenured Faculty 2001   20.3
##  7 Full-Time Tenured Faculty 2003   19.3
##  8 Full-Time Tenured Faculty 2005   17.8
##  9 Full-Time Tenured Faculty 2007   17.2
## 10 Full-Time Tenured Faculty 2009   16.8
## # … with 45 more rows
staff_long %>%
  ggplot(aes(x = year,
             y = value,
             group = faculty_type,
             color = faculty_type)) +
  geom_line(linewidth = 3, lineend = "round") + scale_color_npg() + labs(title ="Instructional Staff Employment Trends", x = "Year", y = "The Number of Employees", color="Faculty Type") + theme(panel.background = element_rect(fill = "white", colour = "grey50"))

If there are 5 faculty types and 11 years of data, how many rows would we have? 55 rows

What changes would you propose making to this plot to tell this story?

I’ll probably do a facet wrap, full-time vs. part-time vs. grad students

0.0.3 Fisheries

fisheries <- read_csv("data/fisheries.csv")
## Rows: 216 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): country
## dbl (3): capture, aquaculture, total
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#maps::map('world',col="grey", fill=TRUE, bg="white", lwd=0.05, mar=rep(0,4),border=0, ylim=c(-80,80))

#world <- maps::map(world)

world_map<- map_data("world")
world_map_df<- fortify(world_map)

world_map_df$country <- world_map_df$region

merged_data <- merge(fisheries, world_map_df, by = "country")


flaw<-ggplot(merged_data, aes(x= long, y = lat)) + geom_polygon(aes(group = group, fill = capture)) + scale_fill_gradient(low = "#4DBBD5FF", high = "#E64B35FF") + theme_void()


#plot_gg(flaw, multicore = TRUE, width = 7, height = 5, zoom = 0.75, phi = 30, theta = 45, backgroud = "white", shadow_intensity = -1)

ggplotly(flaw)

the goal is to show capture, aquaculture, and total fisheries by country. the problem is there are 216 countries. I want to split

0.0.4 Smokers in Whickham

data(Whickham)


ggplot(data = Whickham, 
       mapping = aes(x = age, 
                     y = outcome, color=smoker)) +
  xlab("Age") + 
  ylab("Outcome") + 
  geom_point() +
  scale_color_npg(alpha = 0.5)+ theme(panel.background = element_rect(fill = "white", colour = "grey50")) + labs(color = "Smoker")

Smoke <- Whickham %>%
  count(smoker, outcome)


#I tried a couple of ways to use functions to calculate the percentages and did not work :(, so I did it by hand

smoker_death_rate <- 139/(443+139) *100
nonsmoker_death_rate <- 230/(502+230) *100

smoker_death_rate
## [1] 23.88316
nonsmoker_death_rate 
## [1] 31.42077
ggplot(Smoke, mapping = aes(y = n, x = outcome, fill = outcome))  + facet_wrap(~ smoker) + geom_col() + theme(panel.background = element_rect(fill = "white", colour = "grey50")) + scale_fill_jama() + ggtitle("Non-Smoker vs. Smoker Outcome Comparison") + theme(strip.background = element_rect(fill="white"), strip.text = element_text(size = 10, face = "bold"))

age_groups <- cut(Whickham$age, breaks = c(17, 44, 64, Inf), labels = c("18-44", "45-64", "65+"))


Whickham1 <- Whickham %>%
  mutate(age_cat = age_groups)

sm <- Whickham1 %>%
  count(smoker, age_cat, outcome)

ggplot(sm, mapping = aes(y = n, x = outcome, fill = smoker))  + facet_wrap(~ age_cat) + geom_col(position = "dodge") + theme(panel.background = element_rect(fill = "white", colour = "grey50")) + scale_fill_jama() + ggtitle("Non-Smoker vs. Smoker Outcome Comparison by Age Groups") + theme(strip.background = element_rect(fill="white"), strip.text = element_text(size = 10, face = "bold"))

What type of study do you think these data come from: observational or experiment? Why? observational, we can’t manipulate whether people to be smokers or not

How many observations are in this dataset? What does each observation represent? 1314 obeservations, each observation is a person

How many variables are in this dataset? What type of variable is each? Display each variable using an appropriate visualization. 3 variables, outcome (dichotomous), smoker (dichotomous), and age (continuous)

What would you expect the relationship between smoking status and health outcome to be? Positive correlation, smokers die early

It looks like smokers are more likely to die in 18-44 & 45-64 age groups, but not the case for the 65+ age group. For 65+ people, nonsmokers died a lot more than smokers, which explains why we saw nonsmokers have a higher death rate than smokers in the previous graph. Because this is not an experiment, many reasons can cause 65+ nonsmokers to die e.g., aging, comorbidities. This also speaks to the importance of analyzing data by groups.